Mining New Motifs from Cdna Sequence Data
نویسندگان
چکیده
General biological databases that store basic information on genome, transcriptome, and proteome are indispensable sequence discovery resources. However, they are not necessarily useful for inferring functions of proteins. To see this, we observe that SWISS-PROT —a protein knowledgebase containing curated protein sequences and functional information on domains and diseases—has grown a mere 26-fold in 15 years, from 3,939 entries in 1986 to 126,147 entries in 2003. Similarly, despite the human draft genome and the mouse draft genome and transcriptome, the number of human and mouse protein sequences with some functional information has remained low—7,471 (7.4%) for man and 4,816 (4.7%) for mouse—compared to an estimated proteome of 0.5–1.0 sequences. The majority of sequences in the TrEMBL database of SWISSPROT/TrEMBL, FANTOM, and other similar databases are hypothetical proteins, or are uninformative sequences described as “similar to DKFZ ...” or ”weakly similar to KIAA ....” These sequences have no informative homolog that had diverged from a common ancestor, and have matched to a non-informative homolog. Algorithms for identification of motifs are commonly used to classify these sequences, and to provide functional clues on binding sites, catalytic sites, and active sites, or structure/functions relations. For example 5,873 of 21,050 predicted FANTOM1 protein sequences contain InterPro motifs or domains. In fact, the InterPro name is the only functional description of 900 sequences. Extrapolations from current mouse cDNA data indicate that the proteome is significantly larger than the genome. This underlines the importance of exploring protein sequences, motifs, and modules, to derive potential functions and interactions for these sequences. Strictly defined new protein sequence motifs are either conserved sequences of common ancestry, or are convergence (functional motifs)
منابع مشابه
WebTraceMiner: a web service for processing and mining EST sequence trace files
Expressed sequence tags (ESTs) remain a dominant approach for characterizing the protein-encoding portions of various genomes. Due to inherent deficiencies, they also present serious challenges for data quality control. Before GenBank submission, EST sequences are typically screened and trimmed of vector and adapter/linker sequences, as well as polyA/T tails. Removal of these sequences presents...
متن کاملMining Protein Sequences for Motifs
We use methods from Data Mining and Knowledge Discovery to design an algorithm for detecting motifs in protein sequences. The algorithm assumes that a motif is constituted by the presence of a "good" combination of residues in appropriate locations of the motif. The algorithm attempts to compile such good combinations into a "pattern dictionary" by processing an aligned training set of protein ...
متن کاملA DNA based Approach to find Closed Repetitive Gapped Subsequences from a Sequence Database
In bioinformatics, the discovery of transcription factor binding affinities is important. This is done by sequence analysis of micro array data. The determination of continuous and gapped motifs accurately from the given long sequence of data, say genetic data is challenging and requires a detailed study. In this paper, we propose an algorithm that can be used for finding short continuous, shor...
متن کاملNew Seed Selection Technique for Protein Sequeunce Motif Identification
Bioinformatics is a field devoted to the interpretation and analysis of biological data using computational techniques. In recent years the study of bioinformatics has grown tremendously due to huge amount of biological information generated by the scientific community. Protein sequence motifs are short fragments of conserved amino acids often associated with specific function. Identifying such...
متن کاملSequential Data Mining for Information Extraction from Texts
This paper shows the benefit of using data mining methods for Biological Natural Language Processing. A method for discovering linguistic patterns based on a recursive sequential pattern mining is proposed. It does not require a sentence parsing nor other resource except a training data set. It produces understandable results and we show its interest in the extraction of relations between named...
متن کامل